{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Quantitative Finance\n",
    "\n",
    "Copyright (c) 2019 Python Charmers Pty Ltd, Australia, <https://pythoncharmers.com>. All rights reserved.\n",
    "\n",
    "<img src=\"img/python_charmers_logo.png\" width=\"300\" alt=\"Python Charmers Logo\">\n",
    "\n",
    "Published under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license. See `LICENSE.md` for details.\n",
    "\n",
    "Sponsored by Tibra Global Services, <https://tibra.com>\n",
    "\n",
    "<img src=\"img/tibra_logo.png\" width=\"300\" alt=\"Tibra Logo\">\n",
    "\n",
    "\n",
    "## Module 2.2: Modelling Techniques\n",
    "\n",
    "### 2.2.1 Data for Linear Regression Models\n",
    "\n",
    "In this module we will go through the Linear Regression models again in more detail, before reviewing ARMA and then enhancing it with ARIMA and GARCH. We will revisit the previous considered concepts from a more theoretical view over these two modules, and afterwards we will go back to the normal mixture of code and theory when we investigate ARIMA.\n",
    "\n",
    "\n",
    "### Representing Data\n",
    "\n",
    "There are four major types of data and measurements, which can become a variable in a model. However, some need to be handled differently than simply used as-is. The four major types are:\n",
    "\n",
    "* Ratio, which allows us to multiply values in the range\n",
    "* Interval, which doesn't allow multiplying, but allows averaging\n",
    "* Ordinal, which doesn't allow averaging but allows order\n",
    "* Nominal, also called Categorical, which do not have order\n",
    "\n",
    "Note that some authors use different names, conventions or types, but the general pattern is very consistent across multiple sources."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ratio Data\n",
    "\n",
    "The first type we will look at is **Ratio** data, which are standard numbers. They are measurements with a range, and those measurements have a clear distance between them.\n",
    "\n",
    "An example of an interval is a stock's price. It can be measured, and the differences between prices can be computed. For instance:\n",
    "\n",
    "* Company A's stock is valued at \\$100\n",
    "* Company B's stock is valued at \\$105\n",
    "\n",
    "We can therefore say it costs $5 more to buy a stock in Company B than it does in Company A. Further, we can say that Company B's stock is 5\\% more than Company A's stock. (If this is all seeming a bit obvious, counter examples of data that you can't do this with are coming up).\n",
    "\n",
    "A Ratio doesn't need to encapsulate all information about the samples. For instance, in the above, we don't take into consideration things such as the number of stocks in a company, the ownership proportion, or the movement in the stock price. That's fine - we are just talking about a variable, and not a full model.\n",
    "\n",
    "The next important aspect a Ratio variable has is a \"clear zero\", which makes comparisons possible of the form \"X is twice Y\". For instance, if Company C's stock was valued at $200, we can say it costs \"twice as much\" as Company A to buy. Note that ratios can be negative in some cases (not in stock prices though, although stock valuations could be negative)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Interval Data\n",
    "\n",
    "**Interval** data works a lot like Ratio data. You can compute the mean and median of both, the data is obviously ordered, and you can say \"X is more than Y\".\n",
    "\n",
    "However, Interval data doesn't have a clear zero reference, known as a true zero, which makes comparisons like \"X is twice Y\" meaningless. For instance, Temperature is interval data.\n",
    "\n",
    "If Melbourne has a temperature of 15°C, and Sydney has a temperature of 30°C, we can say that Sydney has a greater temperature than Melbourne. We **cannot** say that Sydney is twice as hot at Melbourne. \n",
    "\n",
    "While this would make intuitive sense in lay speak, it actually doesn't mean anything. The reason? The units matter. If we measure in Fahrenheit, we get Melbourne's temperature as 59°F and Sydney's as 86°F, which is no longer twice, despite the underlying facts not changing.\n",
    "\n",
    "Interval data is still very useful, and you can normally just deal with it as a ratio value. This is especially true with practical data where the data sits in a normal range.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "Answer the following questions intuitively first, and then either research or compute the actual answers to confirm.\n",
    "\n",
    "1. Does the ratio of price of two stocks (X is 105% of Y) change if you convert the prices from USD to AUD before doing the comparison?\n",
    "2. Is temperature in Kelvin a Ratio or an Interval?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. The ratio should not change because they are both being multiple by a constant ratio\n",
    "2. Kelvin would be interval data. Nevermind, I was wrong, Kelvin has a clear 0."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For solutions, see `solutions/ratio_interval.py`*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Ordinal Data\n",
    "\n",
    "As Interval data is like Ratio data with restrictions, **Ordinal** data is a \"restricted\" Interval. The data is ordered, but not really numerical in nature. \n",
    "\n",
    "An example is the Wong-Baker FACES Pain Rating Scale, used for helping children describe how much pain they are in to a doctor or nurse.\n",
    "\n",
    "<img src=\"img/Wong_Baker_Scale.sr.JPG\">\n",
    "<small>Source: https://en.wikipedia.org/wiki/File:Wong_Baker_Scale.sr.JPG</source>\n",
    "\n",
    "Note that even though you may not be able to read the text, you could still describe the level of pain you are in.\n",
    "\n",
    "The data is clearly ordered - a person is in more pain on level 6 of the scale than level 2. However, you wouldn't say they are in \"three times more pain\". Further, you could compute the average pain someone is in, but that value is ultimately meaningless - the different between a 4 and a 6 is not really the same as the difference between a 6 and an 8. Instead, it would make more sense to compute the median or mode, and say \"the patient normally answers around 6\", rather than say \"the average pain rating is 6.3\".\n",
    "\n",
    "Another example is the standard feedback \"How satisfied are you with our service?\", with values ranging from \"Very Unsatisfied\" to \"Very Satisfied\". There is a clear order - one would rather customers be Satisfied than Unsatisfied, and \"Very Satisfied\" is even better! However, there are not really numerical values you can assign clearly to this data. \n",
    "\n",
    "Note that some people *do* assign numerical numbers, and *do* compute the mean value - this does have some use, but is ultimately meaningless, for instance, in comparing two survey results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Nominal data (Categorical)\n",
    "\n",
    "**Nominal** data, also called Categorical or Labels, are unordered categories of data. For instance, if I asked if your pet was a Dog or a Cat, there isn't a \"greater\" of these values (one could argue with this point, but not mathematically).\n",
    "\n",
    "A common type of data here is Gender. Options could include \"Female\", \"Male\" and \"Other\". They could be presented in an order, and could even get numbered labels for storing, but you can't compute the average. You *can* compute the ratio of the total (e.g. 53% Female, 46% Male, 1% Other), but it would make no sense to say the average customer is 53% Female. If anything, that sends a very different message!\n",
    "\n",
    "A two-option nominal value is also called a dichotomous variable. This could be True/False, Correct/Incorrect, Married/Not Married, and other data of the form X / not X."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Questions\n",
    "\n",
    "What type of variable is the following?\n",
    "\n",
    "1. the age of a person\n",
    "2. the time taken for a program to run\n",
    "3. Level of education obtained, out of \"University degree\", \"No Education\", \"High school completed\", \"Some schooling\".\n",
    "4. A variable noting which doctor performed the surgery.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Ratio\n",
    "2. Ratio\n",
    "3. Ordinal\n",
    "4. Nominal"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For solutions, see `solutions/ordinal_nominal.py`*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Dealing with nominal and ordinal data types\n",
    "\n",
    "In statistics and programming, the programs we write often assume data is at least Interval, and perhaps Ratio. We freely compute the mean of data (even if that doesn't make sense) and report the result. One of the key problems here is that we often use numbers to encode variables. For instance, in our gender question, we might assign the following numbers:\n",
    "\n",
    "* Male: 0\n",
    "* Female: 1\n",
    "* Other: 2\n",
    "\n",
    "Therefore, our data looks like this: `gender = [0, 0, 1, 1, 1, 0, 2]`, indicating Male, Male, Female, Female, and so on. This means that technically we can compute `np.mean(gender)`, and that operation will work. Further, we could fit a OLS model on this data, and that will probably also show something.\n",
    "\n",
    "To better encode the gender variable, a one-hot encoding is normally recommended. This expands the data into multiple variables of the form \"Is Male?\", \"Is Female?\" and \"Is Other?\". Only, and exactly, one of these three will be 1, and the others will be always 0. This turns this into a dichotomous nominal variable, which can actually be used as a ratio!\n",
    "\n",
    "\n",
    "<div class=\"alert alert-warning\">\n",
    "    Gender is <i>nearly</i> dichotomous for the general population - most people identify as either male or female. You can do things with dichotomous variables like compute the ratio from the mean, but if you do, you will run the risk of your results being meaningless in some samples.\n",
    "</div>\n",
    "\n",
    "\n",
    "You can encode ordinal data the same way. You do lose some data doing this though, as it turns your ordinal data into nominal data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Exercise\n",
    "\n",
    "1. Review the documentation for Scikit-learn's OrdinalEncoder and OneHotEncoder transformers. What do they do?\n",
    "\n",
    "\n",
    "#### Extended Exercise\n",
    "\n",
    "1. What is the problem with the name of Scikit-Learn's Ordinal Encoder?\n",
    "1. Load some data that contains ordinal and nominal variables (e.g. Boston house prices). Convert the ordinal and nominal variables to encoded forms using the above scikit-learn classes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For solutions see `solutions/ordinal_encoding.py`*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Formats for data\n",
    "\n",
    "Our data is usually called $X$, specifically a 2 dimensional matrix, where the rows correspond to each sample in our dataset (like a stock data for a day, a company, or a person), and columns correspond to features, or variables (such as the closing price, the number of employees or the person's height).\n",
    "\n",
    "Therefore, value $X_{i, j}$ is the value of the $j$th variable for the $i$th sample.\n",
    "\n",
    "Normally, we use the term $n$ to describe how many samples we have, and $k$ to describe how many variables. If no other information is given, assume these values. That said, lots of other terms are used as well (for instance, $m$ is usually used if a second set of data is included for the number of samples in that second set).\n",
    "\n",
    "Therefore, our matrix $X$ has shape $n \\times k$.\n",
    "\n",
    "Next, we have our coefficients, which are $\\beta$, and this is a value for each variable, so it has shape $k$. Often, it is much more useful for this to be a 2D array of size $k \\times 1$. This allows us to compute the values $X\\beta$, which gives a $n \\times 1$ value, which is $Y$.\n",
    "\n",
    "\n",
    "#### Question\n",
    "\n",
    "If $u$ is the error term associated with each sample, what would its shape be?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "nx1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For solutions, see `solutions/error_term_u.py`*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Samples versus Population, and Notation\n",
    "\n",
    "To date, these modules have been a little slack with terminology relating to samples versus populations. We will address that here before we continue further.\n",
    "\n",
    "If we remember our equation for our Linear Regression model, it has been presented so far like this:\n",
    "\n",
    "$Y =  X \\beta + u$\n",
    "\n",
    "In this case, we have $X$ as our independent variables in the format above. $\\beta$ is the set of *actual parameters for mapping the linear relationship*. In other words, if there was a perfect computer simulation of the world, the relationship between $X$ and $y$ would be perfectly modelled by this equation. Note that we may not know these true values - we may never know these actual values. (They may not actually exist either, but the assumptions behind a linear model assume they do.)\n",
    "\n",
    "Finally, $u$ is the error of the model, which really is just what is left over from taking the values of $\\beta$, and then computing $u = Y - \\beta X$. These are the errors, but they are the *actual population errors*, again that are quite theoretical and you may never know what these are. Further, the values of $u$ are not learned, they are the remaining error. It may seem odd that we need $u$ if $\\beta$ is perfectly defined, but note that sometimes the same values for $x$ may result in a different $y$ value. If the $\\beta$ values are fixed, then the only way we can distinguish is through $u$.\n",
    "\n",
    "The above equations deal with ground truth population data, however almost always you are working with sample data. This means that we don't have all $X$ values. This means that, at best, we can estimate the true values for $\\beta$, so we use the notation $\\hat{\\beta}$ to denote we have an estimation of $\\beta$, not the true values.\n",
    "\n",
    "As we do not know the true values for $\\beta$, the errors we get in the model are just estimated too, so we use the notation $\\hat{u}$. Then, finally, we come to our $Y$ values being estimated, so we use the notation $\\hat{Y}$ to describe them, or $\\hat{y}$ to describe the predicted value for a given $y$ value.\n",
    "\n",
    "So while OLS is aiming to get as close to the true values as possible, what we are really learning is:\n",
    "\n",
    "$\\hat{Y} = X \\hat{\\beta} + \\hat{u}$\n",
    "\n",
    "Other notation may use $\\textbf{b}$ as the learned and estimated values for the true values of $\\beta$.\n",
    "\n",
    "This confuses the notation to keep doing this, and have separate symbols for effectively the same thing. For these notebooks, we won't be too particular about the specific symbols and \"hats\" for them. Know however, that if you ever publish a journal paper in an academic journal, you'll need to ensure you have all the symbols correctly identified.\n",
    "\n",
    "At the end of the day, notation is just notation - simply describe what your variables are upfront if you do not know the normal terminology to use.\n",
    "\n",
    "#### Question\n",
    "\n",
    "Why is $\\beta$ presented after $X$ in the previous equations, and not on the left? Hint: they are matrices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "X dimensions is n x k, B dimensions is k x 1, hence needs to be that way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For solutions, see `solutions/dot.py`*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basics of Prediction\n",
    "\n",
    "We will finish this module with a mid-point revision of some of the basics covered so far, but specifically with a view to prediction in the Linear Regression space.\n",
    "\n",
    "The most basic, reasonable, prediction algorithm is simply to predict the mean of your sample of data. This is the best estimator if you have just one variable. i.e. if we have *just* the heights for lots of dogs, and want to do the best guess of the next dog's height, we just use the mean. This is mathematically proven as the best estimate you can get - known as the *expected value*.\n",
    "\n",
    "Our model is therefore of the form:\n",
    "\n",
    "$\\hat{y} = \\bar{x} + u$\n",
    "\n",
    "Where $\\bar{x}$ is the mean of $X$ and $u$ is the errors, the residuals. These residuals are the difference between the prediction and the actual value. You'll often hear this referred to as a measure of the \"Goodness of fit\".\n",
    "\n",
    "The sum of squared error (SSE) is a commonly used metric in comparing two models to determine which is better. It is the sum of the squared of these residuals, therefore:\n",
    "\n",
    "$SSE = \\sum{u^2}$\n",
    "\n",
    "For a single set of values, the mean is the value that minimises the SSE in a single row of data. This is why, according to this model, the mean is the best predictor if we have no other data.\n",
    "\n",
    "The goal of linear regression is to minimise the SSE when we use multiple independent variables to predict a dependent variable. This is only one evaluation metric - there are dozens of commonly used ones, and for financial data we might be more concerned with \"absolute profit\" or some other metric more closely tied to business outcomes. Use the evaluation metric that works for your application. SSE has some nice algorithmic properties, for instance, you can compute the gradient at all points, allowing for solving the OLS algorithm quickly to minimise SSE. Not all metrics are as nice.\n",
    "\n",
    "When you fit any model to data, and want to evaluate how well it does in practice, you must fit your model on one set of data, and evaluate it on another set of data. This is known as a train/test split. You fit your data on the training data, and evaluate on the testing data. A common split is simply to split randomly 1/3 of the data for testing, and the remaining 2/3 for training. The exact numbers usually don't matter too much. If you are using a time series dataset, you'll need to split by time rather than randomly. Think always about how your model will be used in practice - for price prediction, you want to be predicting tomorrow's price, not some random day in the past. Therefore, your model needs to be evaluated when it's trying to predict data in the future.\n",
    "\n",
    "\n",
    "#### Exercise\n",
    "\n",
    "An issue with train/test splits, is that you must, by definition, lose the learning power of 1/3 of your dataset (or whatever you put into the test split). To address this, use cross-validation.\n",
    "\n",
    "Review the documentation for scikit-learn's cross-validation functions at https://scikit-learn.org/stable/modules/cross_validation.html\n",
    "\n",
    "What does cross-validation do, and why is it better than a simple train/test split?\n",
    "\n",
    "Why is there an *additional* test split in the diagram?\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multiple train test splits, can use smaller test sets depending on sample size, and average the results."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*For solutions, see `solutions/cross_validation_one.py`*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
